You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.
A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
This time company wants to harness the available data of existing and potential customers to target the right customers. This will require analyzing the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.
To predict which customer is more likely to purchase the newly introduced travel package.
We can further categorize the variables provided in the manner below, to help us with EDA and other analyses.
Customers details can be further split as demographic details and trip details. With that split, we can have the following groups of data
# to make Python code more structured
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# for feature scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Library to split data
from sklearn.model_selection import train_test_split
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
)
from sklearn.model_selection import GridSearchCV
# library to import XGB classifier
from xgboost import XGBClassifier
Since the file is in an excel sheet, we will use pd.read_excel to load it as a dictionary first.
dict_tourism = pd.read_excel("Tourism.xlsx", sheet_name=["Tourism"])
data = dict_tourism.get("Tourism")
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
data.tail()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
data.shape
(4888, 20)
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | NaN | NaN | NaN | 202443.5 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | NaN | NaN | NaN | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4662.0 | NaN | NaN | NaN | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| TypeofContact | 4863 | 2 | Self Enquiry | 3444 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4888.0 | NaN | NaN | NaN | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4637.0 | NaN | NaN | NaN | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| Occupation | 4888 | 4 | Salaried | 2368 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4888 | 3 | Male | 2916 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4888.0 | NaN | NaN | NaN | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4843.0 | NaN | NaN | NaN | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4888 | 5 | Basic | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4862.0 | NaN | NaN | NaN | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4888 | 4 | Married | 2340 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4748.0 | NaN | NaN | NaN | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4888.0 | NaN | NaN | NaN | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4888.0 | NaN | NaN | NaN | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4888.0 | NaN | NaN | NaN | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | NaN | NaN | NaN | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4888 | 5 | Executive | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4655.0 | NaN | NaN | NaN | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
Since Customer ID is an unique value that identifies each customer, we can drop this column as it is not needed for our analyses.
data.drop("CustomerID", axis=1, inplace=True)
Since we have dropped Customer ID, it is now ideal to check for duplicated rows because having Customer IDs are unique and would have prevented us from finding true duplicates.
# print the number of duplicated rows
print("Number of duplicated rows: ", data[data.duplicated()].shape[0])
# let's isolate the duplicated rows and verify they are indeed duplicates before dropping them
# keep = False will get all the duplicate items without eliminating duplicate rows.
# we can sort by the Monthly Income in order to isolate any two rows which are duplicates and compare
data[data.duplicated(keep=False)].sort_values("MonthlyIncome").head(10)
Number of duplicated rows: 141
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1971 | 0 | 30.0 | Company Invited | 1 | 9.0 | Small Business | Female | 3 | 3.0 | Basic | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | Executive | 17083.0 |
| 501 | 0 | 30.0 | Company Invited | 1 | 9.0 | Small Business | Female | 3 | 3.0 | Basic | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | Executive | 17083.0 |
| 1642 | 0 | 36.0 | Company Invited | 1 | 9.0 | Small Business | Male | 1 | 3.0 | Basic | 4.0 | Single | 5.0 | 0 | 4 | 1 | 0.0 | Executive | 17088.0 |
| 172 | 0 | 36.0 | Company Invited | 1 | 9.0 | Small Business | Male | 1 | 3.0 | Basic | 4.0 | Single | 5.0 | 0 | 4 | 1 | 0.0 | Executive | 17088.0 |
| 1732 | 0 | 32.0 | Self Enquiry | 1 | 8.0 | Large Business | Male | 2 | 4.0 | Deluxe | 5.0 | Single | 5.0 | 0 | 4 | 0 | 1.0 | Manager | 17176.0 |
| 262 | 0 | 32.0 | Self Enquiry | 1 | 8.0 | Large Business | Male | 2 | 4.0 | Deluxe | 5.0 | Single | 5.0 | 0 | 4 | 0 | 1.0 | Manager | 17176.0 |
| 2125 | 0 | 33.0 | Self Enquiry | 2 | 9.0 | Salaried | Male | 2 | 3.0 | Basic | 4.0 | Married | 4.0 | 1 | 5 | 0 | 1.0 | Executive | 17277.0 |
| 655 | 0 | 33.0 | Self Enquiry | 2 | 9.0 | Salaried | Male | 2 | 3.0 | Basic | 4.0 | Married | 4.0 | 1 | 5 | 0 | 1.0 | Executive | 17277.0 |
| 153 | 0 | 45.0 | Self Enquiry | 1 | 15.0 | Salaried | Male | 2 | 3.0 | Deluxe | 4.0 | Married | 1.0 | 0 | 4 | 1 | 0.0 | Manager | 17348.0 |
| 1623 | 0 | 45.0 | Self Enquiry | 1 | 15.0 | Salaried | Male | 2 | 3.0 | Deluxe | 4.0 | Married | 1.0 | 0 | 4 | 1 | 0.0 | Manager | 17348.0 |
# we have verified the duplicates and it would be best to drop such rows
data.drop_duplicates(inplace=True)
# number of rows remaining in the dataset after duplicate removal
data.shape[0]
4747
# verifying after dropping
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4747 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4747 non-null int64 1 Age 4531 non-null float64 2 TypeofContact 4722 non-null object 3 CityTier 4747 non-null int64 4 DurationOfPitch 4501 non-null float64 5 Occupation 4747 non-null object 6 Gender 4747 non-null object 7 NumberOfPersonVisiting 4747 non-null int64 8 NumberOfFollowups 4703 non-null float64 9 ProductPitched 4747 non-null object 10 PreferredPropertyStar 4721 non-null float64 11 MaritalStatus 4747 non-null object 12 NumberOfTrips 4609 non-null float64 13 Passport 4747 non-null int64 14 PitchSatisfactionScore 4747 non-null int64 15 OwnCar 4747 non-null int64 16 NumberOfChildrenVisiting 4687 non-null float64 17 Designation 4747 non-null object 18 MonthlyIncome 4523 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 741.7+ KB
# checking which columns have null values
data.isnull().sum().sort_values(ascending=False)
DurationOfPitch 246 MonthlyIncome 224 Age 216 NumberOfTrips 138 NumberOfChildrenVisiting 60 NumberOfFollowups 44 PreferredPropertyStar 26 TypeofContact 25 Designation 0 OwnCar 0 PitchSatisfactionScore 0 Passport 0 ProdTaken 0 MaritalStatus 0 NumberOfPersonVisiting 0 Gender 0 Occupation 0 CityTier 0 ProductPitched 0 dtype: int64
In order to impute missing values, let's see if there are any meaningful correlations between variables with missing values and those without.
# plotting the correlation heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(
data.corr(), annot=True, vmin=-1, vmax=1, fmt=".1g", cmap="Spectral", cbar=False,
)
plt.show()
Given that variables do not have very strong correlations, we can take a few measures to impute missing values.
# groupby Designation and check the null values in MonthlyIncome
data[data["MonthlyIncome"].isnull()].groupby(["Designation"]).size()
Designation Executive 82 Manager 142 dtype: int64
# impute with median of the group
data["MonthlyIncome"] = data.groupby(["Designation"])["MonthlyIncome"].transform(
lambda x: x.fillna(x.median())
)
# verify after imputing
data[data["MonthlyIncome"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# impute with median of Age after grouping by Designation
data["Age"] = data.groupby(["Designation"])["Age"].transform(
lambda x: x.fillna(x.median())
)
# verify after imputing
data[data["Age"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# impute with median of DurationOfPitch after grouping by ProductPitched and ProdTaken
data["DurationOfPitch"] = data.groupby(["ProductPitched", "ProdTaken"])[
"DurationOfPitch"
].transform(lambda x: x.fillna(x.median()))
# verify after imputing
data[data["DurationOfPitch"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# checking the counts of values in TypeofContact
data["TypeofContact"].value_counts()
Self Enquiry 3350 Company Invited 1372 Name: TypeofContact, dtype: int64
# impute with frequently occurring value
data["TypeofContact"] = data["TypeofContact"].fillna("Self Enquiry")
# verify after imputing
data[data["TypeofContact"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# impute with median of NumberOfChildrenVisiting after grouping by NumberOfPersonVisiting
data["NumberOfChildrenVisiting"] = data.groupby(["NumberOfPersonVisiting"])[
"NumberOfChildrenVisiting"
].transform(lambda x: x.fillna(x.median()))
# verify after imputing
data[data["NumberOfChildrenVisiting"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# impute with mode of PreferredPropertyStar
data["PreferredPropertyStar"] = data["PreferredPropertyStar"].fillna(
data["PreferredPropertyStar"].mode().iloc[0]
)
# verify after imputing
data[data["PreferredPropertyStar"].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
# creating a list of numerical columns remaining for missing value treatment
missing_numerical = ["NumberOfTrips", "NumberOfFollowups"]
# function for replacing with the Median value of the attributes
medianFiller = lambda x: x.fillna(x.median())
# apply the function
data[missing_numerical] = data[missing_numerical].apply(medianFiller, axis=0)
# verify all columns after null value treatment
data.isnull().sum().sort_values(ascending=False)
ProdTaken 0 PreferredPropertyStar 0 Designation 0 NumberOfChildrenVisiting 0 OwnCar 0 PitchSatisfactionScore 0 Passport 0 NumberOfTrips 0 MaritalStatus 0 ProductPitched 0 Age 0 NumberOfFollowups 0 NumberOfPersonVisiting 0 Gender 0 Occupation 0 DurationOfPitch 0 CityTier 0 TypeofContact 0 MonthlyIncome 0 dtype: int64
Let's find out what are the unique values in the object columns to check if there are any irregularities we need to correct.
# create a list of categorical columns
obj_cols = data.select_dtypes(["object", "category"])
# get the valuecounts
for i in obj_cols:
print("Total of", i, ": ", obj_cols[i].count())
print(obj_cols[i].value_counts())
print("-" * 50)
print("\n")
Total of TypeofContact : 4747 Self Enquiry 3375 Company Invited 1372 Name: TypeofContact, dtype: int64 -------------------------------------------------- Total of Occupation : 4747 Salaried 2293 Small Business 2028 Large Business 424 Free Lancer 2 Name: Occupation, dtype: int64 -------------------------------------------------- Total of Gender : 4747 Male 2835 Female 1769 Fe Male 143 Name: Gender, dtype: int64 -------------------------------------------------- Total of ProductPitched : 4747 Basic 1800 Deluxe 1684 Standard 714 Super Deluxe 324 King 225 Name: ProductPitched, dtype: int64 -------------------------------------------------- Total of MaritalStatus : 4747 Married 2279 Divorced 950 Single 875 Unmarried 643 Name: MaritalStatus, dtype: int64 -------------------------------------------------- Total of Designation : 4747 Executive 1800 Manager 1684 Senior Manager 714 AVP 324 VP 225 Name: Designation, dtype: int64 --------------------------------------------------
# treating error
data.Gender = data.Gender.replace("Fe Male", "Female")
# verify the update
data.Gender.value_counts()
Male 2835 Female 1912 Name: Gender, dtype: int64
Once again, we can take a summary of the dataset again.
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4747.0 | NaN | NaN | NaN | 0.188329 | 0.391016 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4747.0 | NaN | NaN | NaN | 37.396882 | 9.164235 | 18.0 | 31.0 | 36.0 | 43.0 | 61.0 |
| TypeofContact | 4747 | 2 | Self Enquiry | 3375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4747.0 | NaN | NaN | NaN | 1.655151 | 0.917416 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4747.0 | NaN | NaN | NaN | 15.400463 | 8.327546 | 5.0 | 9.0 | 13.0 | 19.0 | 127.0 |
| Occupation | 4747 | 4 | Salaried | 2293 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4747 | 2 | Male | 2835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4747.0 | NaN | NaN | NaN | 2.911734 | 0.72404 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4747.0 | NaN | NaN | NaN | 3.707815 | 1.004388 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4747 | 5 | Basic | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4747.0 | NaN | NaN | NaN | 3.580156 | 0.799316 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4747 | 4 | Married | 2279 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4747.0 | NaN | NaN | NaN | 3.226459 | 1.82121 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4747.0 | NaN | NaN | NaN | 0.289657 | 0.453651 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4747.0 | NaN | NaN | NaN | 3.051612 | 1.369584 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4747.0 | NaN | NaN | NaN | 0.617653 | 0.486012 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4747.0 | NaN | NaN | NaN | 1.194228 | 0.856413 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4747 | 5 | Executive | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4747.0 | NaN | NaN | NaN | 23532.024226 | 5271.613507 | 1000.0 | 20474.5 | 22400.0 | 25389.0 | 98678.0 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4747 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4747 non-null int64 1 Age 4747 non-null float64 2 TypeofContact 4747 non-null object 3 CityTier 4747 non-null int64 4 DurationOfPitch 4747 non-null float64 5 Occupation 4747 non-null object 6 Gender 4747 non-null object 7 NumberOfPersonVisiting 4747 non-null int64 8 NumberOfFollowups 4747 non-null float64 9 ProductPitched 4747 non-null object 10 PreferredPropertyStar 4747 non-null float64 11 MaritalStatus 4747 non-null object 12 NumberOfTrips 4747 non-null float64 13 Passport 4747 non-null int64 14 PitchSatisfactionScore 4747 non-null int64 15 OwnCar 4747 non-null int64 16 NumberOfChildrenVisiting 4747 non-null float64 17 Designation 4747 non-null object 18 MonthlyIncome 4747 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 741.7+ KB
Looking at the info summary of the dataset, some of the variables have an inappropriate datatype. Let's re-classify the variables based on their nature, and if needed, transform them to the appropriate datatypes. From the dataset, the variables can be categorized in the following manner
# creating list of category columns that are not object type
cat_cols_transform = [
"TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"PreferredPropertyStar",
"MaritalStatus",
"Passport",
"Designation",
"OwnCar",
"ProdTaken",
]
data[cat_cols_transform] = data[cat_cols_transform].astype("category")
# check the dataset for updated datatypes
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4747 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4747 non-null category 1 Age 4747 non-null float64 2 TypeofContact 4747 non-null category 3 CityTier 4747 non-null int64 4 DurationOfPitch 4747 non-null float64 5 Occupation 4747 non-null category 6 Gender 4747 non-null category 7 NumberOfPersonVisiting 4747 non-null int64 8 NumberOfFollowups 4747 non-null float64 9 ProductPitched 4747 non-null category 10 PreferredPropertyStar 4747 non-null category 11 MaritalStatus 4747 non-null category 12 NumberOfTrips 4747 non-null float64 13 Passport 4747 non-null category 14 PitchSatisfactionScore 4747 non-null int64 15 OwnCar 4747 non-null category 16 NumberOfChildrenVisiting 4747 non-null float64 17 Designation 4747 non-null category 18 MonthlyIncome 4747 non-null float64 dtypes: category(10), float64(6), int64(3) memory usage: 418.8 KB
Let's take a count of unique values in the categorical columns and see what values they contain again.
# create a list of categorical columns
cat_cols = data.select_dtypes(["category"])
# get the valuecounts
for i in cat_cols:
print(cat_cols[i].value_counts())
print("-" * 50)
print("\n")
0 3853 1 894 Name: ProdTaken, dtype: int64 -------------------------------------------------- Self Enquiry 3375 Company Invited 1372 Name: TypeofContact, dtype: int64 -------------------------------------------------- Salaried 2293 Small Business 2028 Large Business 424 Free Lancer 2 Name: Occupation, dtype: int64 -------------------------------------------------- Male 2835 Female 1912 Name: Gender, dtype: int64 -------------------------------------------------- Basic 1800 Deluxe 1684 Standard 714 Super Deluxe 324 King 225 Name: ProductPitched, dtype: int64 -------------------------------------------------- 3.0 2931 5.0 938 4.0 878 Name: PreferredPropertyStar, dtype: int64 -------------------------------------------------- Married 2279 Divorced 950 Single 875 Unmarried 643 Name: MaritalStatus, dtype: int64 -------------------------------------------------- 0 3372 1 1375 Name: Passport, dtype: int64 -------------------------------------------------- 1 2932 0 1815 Name: OwnCar, dtype: int64 -------------------------------------------------- Executive 1800 Manager 1684 Senior Manager 714 AVP 324 VP 225 Name: Designation, dtype: int64 --------------------------------------------------
In order to do customer profiling, let's bin data in some of the columns, like:
# defining bins for Age
bins_age = (15, 30, 45, 60, 75)
# defining labels
labels_age = ["15-30", "30-45", "45-60", "60-75"]
data["age_bin"] = pd.cut(x=data["Age"], bins=bins_age, labels=labels_age)
# defining bins for DurationOfPitch
bins_dur = (0, 10, 50, 100, 200)
# defining labels
labels_dur = ["Short", "Regular", "Medium", "Long"]
data["duration_bin"] = pd.cut(
x=data["DurationOfPitch"], bins=bins_dur, labels=labels_dur
)
# defining bins for MonthlyIncome
bins_inc = (500, 10000, 20000, 60000, 100000)
# defining labels
labels_inc = ["<10K", "10K-20K", "20K-60K", "60K-100K"]
data["income_bin"] = pd.cut(x=data["MonthlyIncome"], bins=bins_inc, labels=labels_inc)
# defining bins for NumberOfTrips
bins_trip = (0, 5, 10, 20, 30)
# defining labels
labels_trip = ["<5", "5-10", "10-20", "20-30"]
data["trips_bin"] = pd.cut(x=data["NumberOfTrips"], bins=bins_trip, labels=labels_trip)
# building different plots to understand the data in Age column
# plotting countplot for the binned Age column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="age_bin", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting histplot and boxplot for Age
plt.figure(figsize=(5, 5))
ax = sns.histplot(data=data, x="Age", kde=True)
plt.show()
ax = sns.boxplot(data=data, x="Age", palette="viridis")
plt.show()
# building different plots to understand the data in MonthlyIncome column
# plotting countplot for the MonthlyIncome column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="income_bin", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting histplot and boxplot for MonthlyIncome
plt.figure(figsize=(12, 5))
ax = sns.histplot(data=data, x="MonthlyIncome", kde=True)
plt.show()
plt.figure(figsize=(12, 5))
ax = sns.boxplot(data=data, x="MonthlyIncome", palette="viridis")
plt.show()
# plotting countplot for the ProdTaken column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="ProdTaken", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the TypeofContact column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="TypeofContact", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the CityTier column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="CityTier", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Occupation column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Occupation", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Gender column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Gender", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the MaritalStatus column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="MaritalStatus", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the binned NumberOfTrips column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="trips_bin", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot and boxplot for NumberOfTrips
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="NumberOfTrips")
plt.show()
plt.figure(figsize=(7, 5))
ax = sns.boxplot(data=data, x="NumberOfTrips", palette="viridis")
plt.show()
# plotting countplot for the Passport column
plt.figure(figsize=(4, 4))
ax = sns.countplot(data=data, x="Passport", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the OwnCar column
plt.figure(figsize=(4, 4))
ax = sns.countplot(data=data, x="OwnCar", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the NumberOfChildrenVisiting column
plt.figure(figsize=(4, 4))
ax = sns.countplot(data=data, x="NumberOfChildrenVisiting", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the NumberOfPersonVisiting column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="NumberOfPersonVisiting", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Designation column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Designation", palette="viridis")
ax.bar_label(ax.containers[0])
plt.xticks(rotation=45)
plt.show()
# plotting countplot for the PreferredPropertyStar column
plt.figure(figsize=(4, 4))
ax = sns.countplot(data=data, x="PreferredPropertyStar", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the binned DurationOfPitch column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="duration_bin", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting histplot and boxplot for DurationOfPitch
plt.figure(figsize=(7, 5))
ax = sns.histplot(data=data, x="DurationOfPitch", kde=True)
plt.show()
plt.figure(figsize=(7, 5))
ax = sns.boxplot(data=data, x="DurationOfPitch", palette="viridis")
plt.show()
# plotting histplot and boxplot for NumberOfFollowups
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="NumberOfFollowups")
ax.bar_label(ax.containers[0])
plt.show()
# plotting histplot and boxplot for PitchSatisfactionScore
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="PitchSatisfactionScore")
ax.bar_label(ax.containers[0])
plt.show()
# plotting pairplot between the different variables to understand the nature of correlations
sns.pairplot(data, hue="ProdTaken")
plt.show()
# plotting countplot for the ProductPitched column where ProdTaken=1
plt.figure(figsize=(5, 5))
ax = sns.countplot(
data=data[data["ProdTaken"] == 1], x="ProductPitched", palette="viridis"
)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=45)
plt.show()
# boxplot of numerical columns vs the ProdTaken
cols = data[
["Age", "MonthlyIncome", "NumberOfTrips", "DurationOfPitch", "NumberOfFollowups"]
].columns.tolist()
plt.figure(figsize=(12, 6))
for i, variable in enumerate(cols):
plt.subplot(1, 5, i + 1)
sns.boxplot(data["ProdTaken"], data[variable], palette="viridis")
plt.tight_layout()
plt.xticks(rotation=45)
plt.title(variable)
plt.show()
# countplots for the interval variables vs the ProdTaken
cat = data[["age_bin", "income_bin", "duration_bin"]].columns.tolist()
plt.figure(figsize=(12, 5))
for i, variable in enumerate(cat):
plt.subplot(1, 3, i + 1)
sns.countplot(data=data, hue="ProdTaken", x=data[variable], palette="viridis")
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# understanding different relationships between demographic variables and ProdTaken
demographic = data[
[
"TypeofContact",
"CityTier",
"Occupation",
"Gender",
"MaritalStatus",
"Passport",
"OwnCar",
"Designation",
]
].columns.tolist()
plt.figure(figsize=(15, 8))
for i, variable in enumerate(demographic):
plt.subplot(2, 4, i + 1)
sns.countplot(data=data, hue="ProdTaken", x=data[variable], palette="viridis")
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# understanding relationship between interaction variables and ProdTaken
interaction = data[
["duration_bin", "ProductPitched", "PitchSatisfactionScore", "NumberOfFollowups",]
].columns.tolist()
plt.figure(figsize=(15, 5))
for i, variable in enumerate(interaction):
plt.subplot(1, 4, i + 1)
sns.countplot(data=data, hue="ProdTaken", x=data[variable], palette="viridis")
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# understanding relationship between trip related variables and ProdTaken
# omitting the PreferredPropertyStar variables since all customers chose 3 as the preferred rating
trip = data[
["NumberOfPersonVisiting", "trips_bin", "NumberOfChildrenVisiting",]
].columns.tolist()
plt.figure(figsize=(15, 5))
for i, variable in enumerate(trip):
plt.subplot(1, 3, i + 1)
sns.countplot(data=data, hue="ProdTaken", x=data[variable], palette="viridis")
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# countplots for trip related variables vs ProdTaken
trip = data[
["NumberOfPersonVisiting", "NumberOfTrips", "NumberOfChildrenVisiting",]
].columns.tolist()
for i, variable in enumerate(trip):
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, i + 1)
sns.countplot(
data=data[data["ProdTaken"] == 1], x=data[variable], palette="viridis",
)
plt.tight_layout()
plt.show()
# let's see what is the relationship between Designation and PreferredPropertyStar
plt.figure(figsize=(7, 5))
sns.countplot(
data=data[data["ProdTaken"] == 1],
x="Designation",
hue="PreferredPropertyStar",
palette="viridis",
)
plt.tight_layout()
plt.show()
# observe the correlation between numerical features
plt.figure(figsize=(7, 5))
sns.heatmap(
data.corr(), annot=True, vmin=-1, vmax=1, fmt=".1g", cmap="Spectral", cbar=False
)
plt.show()
In this section, let us try to build a customer profile from the data that we have. We will take customers who have purchased the product (ProdTaken == 1) and then see what data we can derive by analysing the different packages they purchased.
# isolate the rows where ProdTaken = 1,
Prod_taken = data[data["ProdTaken"] == 1]
Prod_taken.shape
(894, 23)
There are 894 rows where a positive purchase was made. We know from previous EDA how the distribution of purchase was made. Resurfacing the graph by using Prod_Taken dataframe.
# plotting countplot for the ProductPitched column where ProdTaken=1
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=Prod_taken, x="ProductPitched", palette="viridis")
ax.bar_label(ax.containers[0])
plt.xticks(rotation=45)
plt.show()
# profiling the ProductPitched against the different demographic details
plt.figure(figsize=(15, 20))
for i, variable in enumerate(demographic):
plt.subplot(4, 2, i + 1)
sns.countplot(
data=Prod_taken,
x="ProductPitched",
hue=Prod_taken[variable],
palette="viridis",
)
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# profiling the ProductPitched against the remaining interaction variables
interaction_new = ["duration_bin", "PitchSatisfactionScore", "NumberOfFollowups"]
plt.figure(figsize=(18, 5))
for i, variable in enumerate(interaction_new):
plt.subplot(1, 3, i + 1)
sns.countplot(
data=Prod_taken, x="ProductPitched", hue=Prod_taken[variable],
)
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.show()
# profiling the ProductPitched against the trip variables
trip_profile = [
"NumberOfPersonVisiting",
"trips_bin",
"NumberOfChildrenVisiting",
"PreferredPropertyStar",
]
plt.figure(figsize=(15, 10))
for i, variable in enumerate(trip_profile):
plt.subplot(2, 2, i + 1)
sns.countplot(
data=Prod_taken,
x="ProductPitched",
hue=Prod_taken[variable],
palette="viridis",
)
plt.tight_layout()
plt.title(variable)
plt.legend(loc="upper right")
plt.show()
# profiling the ProductPitched against the binned income variable
sns.countplot(hue="income_bin", x="ProductPitched", data=Prod_taken, palette="viridis").set_title(
"Income vs Product"
)
plt.legend(bbox_to_anchor=(1, 1))
plt.show()
# profiling the ProductPitched against binned age variable
sns.countplot(
hue="age_bin", x="ProductPitched", data=Prod_taken, palette="viridis"
).set_title("Age vs Product")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()
# let's collect the statistical summary of the income and age columns in a separate dataframe
summary_purchase = pd.DataFrame(
Prod_taken.groupby(["ProductPitched"]).agg(
{"MonthlyIncome": {"mean", "min", "max"}, "Age": {"mean", "min", "max"}}
)
)
# define a function to highlight rows in a dataframe
def highlight_rows(df):
return ["background-color: lightgray" if i % 2 == 0 else "" for i in range(len(df))]
# applying the function to the dataframe
summary_purchase.style.apply(highlight_rows, axis=0)
| MonthlyIncome | Age | |||||
|---|---|---|---|---|---|---|
| mean | min | max | mean | min | max | |
| ProductPitched | ||||||
| Basic | 20186.840741 | 16009.000000 | 37868.000000 | 31.285185 | 18.000000 | 59.000000 |
| Deluxe | 23094.722222 | 17086.000000 | 38525.000000 | 37.636364 | 21.000000 | 59.000000 |
| King | 34672.100000 | 17517.000000 | 38537.000000 | 48.900000 | 27.000000 | 59.000000 |
| Standard | 26016.433333 | 17372.000000 | 38395.000000 | 41.191667 | 19.000000 | 60.000000 |
| Super Deluxe | 29829.125000 | 21151.000000 | 37502.000000 | 44.125000 | 39.000000 | 56.000000 |
Note: Interaction parameters are more or less same across the products. Most products have a pitch duration of short to regular. Customer satisfaction for products are mostly average, but they can be anywhere from 1-5. Most products require 3-4 followups, except for Super Deluxe which requires only 1-2.
In the previous EDA, we have seen outliers in columns:
Since Duration of Pitch and Number of Followups are customer interaction data, we assume they will not be available for future potential customers. Therefore, we will only check for the outliers in Monthly Income and Number of Trips
numerical_col = ["MonthlyIncome", "NumberOfTrips"]
plt.figure(figsize=(15, 5))
for i, variable in enumerate(numerical_col):
plt.subplot(1, 3, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# calculating the Q3 values above which outliers exist
for col in numerical_col:
quartiles = np.quantile(data[col], [0.25, 0.75])
Q1 = quartiles[0]
Q3 = quartiles[1]
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
if lower > 0:
print("Outlier values of", col, "are lesser than", f"{lower}")
if upper > 0:
print("Outlier values of", col, "are greater than", f"{upper}")
print("\n")
Outlier values of MonthlyIncome are lesser than 13102.75 Outlier values of MonthlyIncome are greater than 32760.75 Outlier values of NumberOfTrips are greater than 7.0
We can understand the outliers on Monthly Income, by reading them in relation to Designation and Age. As Age increases / Designation is higher, income also tends to be high. We can see if this relationship is challenged, and take actions if it is. We can also take relationships with Occupation to derive meaningful insights.
outlier_income = data[(data.MonthlyIncome < 13102.75) | (data.MonthlyIncome > 32760.75)]
outlier_income.shape
(366, 23)
# isolating all rows lesser than lower whisker
count_lower = outlier_income[outlier_income["MonthlyIncome"] < 13102.75]
print("Monthly Incomes less than 13,102")
print(count_lower["MonthlyIncome"], "\n")
# Creating crosstabs between Designation and age_bins
crosstb1 = pd.crosstab(count_lower.Designation, count_lower.age_bin)
# Creating barplot
barplot = crosstb1.plot.bar(rot=0)
plt.show()
Monthly Incomes less than 13,102 142 1000.0 2586 4678.0 Name: MonthlyIncome, dtype: float64
# isolating all rows greater than upper whisker
count_upper = outlier_income[outlier_income["MonthlyIncome"] > 32760.75]
print("Number of MonthlyIncomes greater than 32,760: ", count_upper.shape[0])
# Creating crosstabs between Designation and age_bins, Designation and Occupation
# Creating barplot to display the cross tabs
crosstb2 = pd.crosstab(count_upper.Designation, count_upper.Occupation)
barplot = crosstb2.plot.bar(rot=0)
plt.title("Designation vs Occupation")
plt.tight_layout()
crosstb3 = pd.crosstab(count_upper.Designation, count_upper.age_bin)
barplot = crosstb3.plot.bar(rot=0)
plt.title("Designation vs Age")
plt.tight_layout()
plt.show()
Number of MonthlyIncomes greater than 32,760: 364
#let's print the top 5 incomes for each of the observed designations to understand what ranges they come in
count_upper.sort_values(by=['Designation', 'MonthlyIncome'], ascending=[True, False], inplace=True)
count_upper_grouped = count_upper.groupby(by='Designation')
for name, group in count_upper_grouped:
print("Designation:", name)
print(group.head(5)[['MonthlyIncome']])
print("\n")
Designation: AVP
MonthlyIncome
4827 37502.0
3097 36602.0
4567 36602.0
3818 36553.0
2873 36539.0
Designation: Executive
MonthlyIncome
2482 98678.0
38 95000.0
4836 37868.0
4869 37865.0
4821 36891.0
Designation: Manager
MonthlyIncome
4832 38525.0
4830 37467.0
4850 36739.0
4859 35558.0
2388 34847.0
Designation: Senior Manager
MonthlyIncome
4870 38395.0
4818 36943.0
2426 34717.0
2374 33265.0
Designation: VP
MonthlyIncome
2634 38677.0
4104 38677.0
3190 38651.0
4660 38651.0
3295 38621.0
MonthlyIncome values:
# dropping off the rows with values we identified as erroneous
data.drop(
data[
(data["MonthlyIncome"] == 1000)
| (data["MonthlyIncome"] == 4678)
| (data["MonthlyIncome"] == 95000)
| (data["MonthlyIncome"] == 98678)
].index,
inplace=True,
)
# verifying that only four rows were dropped.
data.shape
(4743, 23)
Once again, let us take observe the relationships between Number of Trips vs Designation and Monthly Income, and try to derive meaningful insights. The reason why we are considering these variables are because, as you income increases, your lifestyle can comfortably accommodate many trips. It is also possible that as responsibilities increases, you would need to travel more for your company.
# isolate the rows where the NumberOfTrips are higher than the upper whisker
outlier_trips = data[(data.NumberOfTrips > 7)]
outlier_trips.shape
(106, 23)
# Creating crosstabs between Designation and income_bins
crosstb4 = pd.crosstab(outlier_trips.Designation, outlier_income.income_bin)
# Creating barplot
barplot = crosstb4.plot.bar(rot=0)
plt.show()
outlier_trips["NumberOfTrips"].value_counts()
8.0 102 19.0 1 21.0 1 20.0 1 22.0 1 Name: NumberOfTrips, dtype: int64
We can treat the higher values of NumberOfTrips columns, i.e., 19-22 trips, by dropping off the rows with these values
# dropping off the rows with values we identified as erroneous
data.drop(
data[
(data["NumberOfTrips"] == 19.0)
| (data["NumberOfTrips"] == 20.0)
| (data["NumberOfTrips"] == 21.0)
| (data["NumberOfTrips"] == 22.0)
].index,
inplace=True,
)
# verifying the shape of new dataset
data.shape
(4739, 23)
We can take a look at the numerical columns and see if there is a need for scaling
#isolating the numerical datatype columns, excluding the customer interaction data
nume_cols = [
"Age",
"NumberOfPersonVisiting",
"NumberOfTrips",
"MonthlyIncome",
"NumberOfChildrenVisiting",
]
plt.figure(figsize=(10,10))
for i, variable in enumerate(nume_cols):
plt.subplot(4, 2, i + 1)
sns.histplot(data=data, x=data[variable], kde=True)
plt.title(variable)
plt.tight_layout()
plt.show()
All the numerical columns are on different scales. Since outliers have been treated and the distributions are skewed, we can use MinMax scaling on these features
# perform MinMax scaling
data[
[
"Age",
"NumberOfPersonVisiting",
"NumberOfTrips",
"MonthlyIncome",
"NumberOfChildrenVisiting",
]
] = MinMaxScaler().fit_transform(
data[
[
"Age",
"NumberOfPersonVisiting",
"NumberOfTrips",
"MonthlyIncome",
"NumberOfChildrenVisiting",
]
]
)
plt.figure(figsize=(10, 10))
for i, variable in enumerate(nume_cols):
plt.subplot(4, 2, i + 1)
sns.histplot(data=data, x=data[variable], kde=True)
plt.title(col)
plt.tight_layout()
plt.show()
Let us start to build our model using:
When we build the model, the following outcomes can be wrongly predicted.
The stated objective of the business is to predict which customer is more likely to purchase a new product, therefore it is necessary to identify as many potential customers as possible, and reduce False Negatives. Therefore, we must aim at increasing recall.
Recall is the metric of interest here.
We will drop the customer interaction variables, since we cannot have those variables for new customers. We will also drop the binned variables, as we have already used them in EDA and they cannot be useful anymore in model building.
data = data.drop(
[
"ProductPitched",
"PitchSatisfactionScore",
"NumberOfFollowups",
"DurationOfPitch",
"age_bin",
"trips_bin",
"duration_bin",
"income_bin",
],
axis=1,
)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4739 entries, 0 to 4887 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4739 non-null category 1 Age 4739 non-null float64 2 TypeofContact 4739 non-null category 3 CityTier 4739 non-null int64 4 Occupation 4739 non-null category 5 Gender 4739 non-null category 6 NumberOfPersonVisiting 4739 non-null float64 7 PreferredPropertyStar 4739 non-null category 8 MaritalStatus 4739 non-null category 9 NumberOfTrips 4739 non-null float64 10 Passport 4739 non-null category 11 OwnCar 4739 non-null category 12 NumberOfChildrenVisiting 4739 non-null float64 13 Designation 4739 non-null category 14 MonthlyIncome 4739 non-null float64 dtypes: category(9), float64(5), int64(1) memory usage: 431.2 KB
# creating dummy variables of all object type columns
# first dummy column is dropped
data = pd.get_dummies(
data,
columns=[
"TypeofContact",
"CityTier",
"Occupation",
"Gender",
"MaritalStatus",
"Passport",
"OwnCar",
"Designation",
"PreferredPropertyStar",
],
drop_first=True,
)
data.head()
| ProdTaken | Age | NumberOfPersonVisiting | NumberOfTrips | NumberOfChildrenVisiting | MonthlyIncome | TypeofContact_Self Enquiry | CityTier_2 | CityTier_3 | Occupation_Large Business | ... | MaritalStatus_Single | MaritalStatus_Unmarried | Passport_1 | OwnCar_1 | Designation_Executive | Designation_Manager | Designation_Senior Manager | Designation_VP | PreferredPropertyStar_4.0 | PreferredPropertyStar_5.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 0.50 | 0.000000 | 0.000000 | 0.219869 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0.720930 | 0.50 | 0.142857 | 0.666667 | 0.181798 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 1 | 0.441860 | 0.50 | 0.857143 | 0.000000 | 0.047688 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0.348837 | 0.25 | 0.142857 | 0.333333 | 0.083819 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0.325581 | 0.25 | 0.000000 | 0.000000 | 0.108479 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 24 columns
We will begin with understanding the split of true and false cases in the dataset.
n_true = len(data.loc[data["ProdTaken"] == 1])
n_false = len(data.loc[data["ProdTaken"] == 0])
print(
"Number of true cases: {0} ({1:2.2f}%)".format(
n_true, (n_true / (n_true + n_false)) * 100
)
)
print(
"Number of false cases: {0} ({1:2.2f}%)".format(
n_false, (n_false / (n_true + n_false)) * 100
)
)
Number of true cases: 892 (18.82%) Number of false cases: 3847 (81.18%)
We can see that our dataset is highly imbalanced, with false classes present in 81% of cases. This means that our models will tend to predicting most cases as false, therefore accuracy scores of models may not be a good metric to rely on. During model building, we will take a few steps to offset these imbalances.
# creating dataframes of independent variables and dependent variables
X = data.drop("ProdTaken", axis=1)
y = data["ProdTaken"]
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
# Let's check split of data
print(
"{0:0.2f}% data is in training set".format((len(X_train) / len(data.index)) * 100)
)
print("{0:0.2f}% data is in test set".format((len(X_test) / len(data.index)) * 100))
69.99% data is in training set 30.01% data is in test set
# printing the percentage of true values in the ProdTaken column in the original dataset
print(
"Original ProdTaken True Values : {0} ({1:0.2f}%)".format(
len(data.loc[data["ProdTaken"] == 1]),
(len(data.loc[data["ProdTaken"] == 1]) / len(data.index)) * 100,
)
)
# printing the percentage of false values in the ProdTaken column in the original dataset
print(
"Original ProdTaken False Values : {0} ({1:0.2f}%)".format(
len(data.loc[data["ProdTaken"] == 0]),
(len(data.loc[data["ProdTaken"] == 0]) / len(data.index)) * 100,
)
)
print("")
# printing the percentage of true values in the y_train dataset
print(
"Training ProdTaken True Values : {0} ({1:0.2f}%)".format(
len(y_train[y_train[:] == 1]),
(len(y_train[y_train[:] == 1]) / len(y_train)) * 100,
)
)
# printing the percentage of false values in the y_train dataset
print(
"Training ProdTaken False Values : {0} ({1:0.2f}%)".format(
len(y_train[y_train[:] == 0]),
(len(y_train[y_train[:] == 0]) / len(y_train)) * 100,
)
)
print("")
# printing the percentage of true values in the y_test dataset
print(
"Test ProdTaken True Values : {0} ({1:0.2f}%)".format(
len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1]) / len(y_test)) * 100
)
)
# printing the percentage of false values in the y_test dataset
print(
"Test ProdTaken False Values : {0} ({1:0.2f}%)".format(
len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0]) / len(y_test)) * 100
)
)
print("")
Original ProdTaken True Values : 892 (18.82%) Original ProdTaken False Values : 3847 (81.18%) Training ProdTaken True Values : 624 (18.81%) Training ProdTaken False Values : 2693 (81.19%) Test ProdTaken True Values : 268 (18.85%) Test ProdTaken False Values : 1154 (81.15%)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to build the confusion matrix
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
# Calculating training performance
print("Training performance:")
dtree_model_train_perf = model_performance_classification_sklearn(
d_tree, X_train, y_train
)
dtree_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating test performance
dtree_model_test_perf = model_performance_classification_sklearn(d_tree, X_test, y_test)
print("Testing performance:")
dtree_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.871308 | 0.626866 | 0.669323 | 0.647399 |
# Creating confusion matrix
confusion_matrix_sklearn(d_tree, X_test, y_test)
from sklearn import tree
# printing the decision tree in a graphical format using plot_tree from sklearn
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
d_tree,
feature_names=list(X.columns),
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
# Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(
rf_estimator, X_train, y_train
)
print("Training performance")
rf_estimator_model_train_perf
Training performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
rf_estimator_model_test_perf = model_performance_classification_sklearn(
rf_estimator, X_test, y_test
)
print("Testing performance:")
rf_estimator_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.873418 | 0.421642 | 0.818841 | 0.55665 |
# Creating confusion matrix
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
# Fitting the model
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train, y_train)
BaggingClassifier(random_state=1)
# Calculating training performance
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
bagging_classifier, X_train, y_train
)
print("Training performance:")
bagging_classifier_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.990654 | 0.955128 | 0.994992 | 0.974652 |
# Calculating test performance
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
bagging_classifier, X_test, y_test
)
print("Testing performance:")
bagging_classifier_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.869198 | 0.496269 | 0.722826 | 0.588496 |
# Creating confusion matrix
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
# Choose the decision tree with inversed class weights
dtree_estimator = DecisionTreeClassifier(
class_weight={0: 0.19, 1: 0.81}, random_state=1
)
# Grid of parameters to choose from
parameters = {
"max_depth": (2, 6, 10, 12),
"min_samples_leaf": [5, 7, 10, 15],
"max_leaf_nodes": [2, 3, 5, 10, 15],
"min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=6,
max_leaf_nodes=10, min_impurity_decrease=0.0001,
min_samples_leaf=5, random_state=1)
# Calculating training performance
dtree_estimator_model_train_perf = model_performance_classification_sklearn(
dtree_estimator, X_train, y_train
)
print("Training performance:")
dtree_estimator_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.806452 | 0.61859 | 0.488608 | 0.545969 |
# Calculating test performance
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
dtree_estimator, X_test, y_test
)
print("Testing performance:")
dtree_estimator_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.774965 | 0.570896 | 0.427374 | 0.488818 |
# Creating confusion matrix
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
# printing the decision tree in a graphical format using plot_tree from sklearn
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
dtree_estimator,
feature_names=list(X.columns),
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Let's tune the Random Forest by inverse and balanced class weights.
# inversed class weights Random Forest Classifier
rf_tuned = RandomForestClassifier(
class_weight={0: 0.19, 1: 0.81}, random_state=1, oob_score=True, bootstrap=True
)
parameters = {
"max_depth": [4, 6, 8, 10, None],
"max_features": ["sqrt", "log2", None],
"n_estimators": [80, 90, 100, 110, 120],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=4,
max_features='sqrt', n_estimators=120, oob_score=True,
random_state=1)
# Calculating training performance
rf_tuned_model_train_perf = model_performance_classification_sklearn(
rf_tuned, X_train, y_train
)
print("Training performance:")
rf_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.807055 | 0.685897 | 0.490826 | 0.572193 |
# Calculating test performance
rf_tuned_model_test_perf = model_performance_classification_sklearn(
rf_tuned, X_test, y_test
)
print("Testing performance:")
rf_tuned_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.774262 | 0.634328 | 0.43257 | 0.514372 |
# Creating confusion matrix
confusion_matrix_sklearn(rf_tuned, X_test, y_test)
Let's tune the bagging classifier also with inversed and balanced class weights
# inversed class weight bagging classifier.
bagging_estimator_tuned = BaggingClassifier(
base_estimator=DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}),
random_state=1,
)
# Grid of parameters to choose from
parameters = {
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.19,
1: 0.81}),
max_features=1, max_samples=0.7, random_state=1)
# Calculating train performance
bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train)
print("Training performance:")
bagging_estimator_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.593307 | 0.604167 | 0.254902 | 0.358535 |
# Calculating test performance
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(
bagging_estimator_tuned, X_test, y_test
)
print("Testing performance:")
bagging_estimator_tuned_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.567511 | 0.548507 | 0.229329 | 0.323432 |
# Creating confusion matrix
confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)
# chooseing Logistic Regression as the base estimator
bagging_lr = BaggingClassifier(
base_estimator=LogisticRegression(solver="liblinear", random_state=1),
random_state=1,
)
bagging_lr.fit(X_train, y_train)
BaggingClassifier(base_estimator=LogisticRegression(random_state=1,
solver='liblinear'),
random_state=1)
# Calculating train performance
bagging_lr_model_train_perf = model_performance_classification_sklearn(
bagging_lr, X_train, y_train
)
print("Training performance:")
bagging_lr_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839916 | 0.278846 | 0.682353 | 0.395904 |
# Calculating test performance
bagging_lr_model_test_perf = model_performance_classification_sklearn(
bagging_lr, X_test, y_test
)
print("Testing performance:")
bagging_lr_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.833333 | 0.242537 | 0.656566 | 0.354223 |
# Creating confusion matrix
confusion_matrix_sklearn(bagging_lr, X_test, y_test)
# Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train, y_train)
AdaBoostClassifier(random_state=1)
# Calculating train performance
ab_classifier_model_train_perf = model_performance_classification_sklearn(
ab_classifier, X_train, y_train
)
print("Training performance:")
ab_classifier_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.848055 | 0.307692 | 0.727273 | 0.432432 |
# Calculating test performance
ab_classifier_model_test_perf = model_performance_classification_sklearn(
ab_classifier, X_test, y_test
)
print("Testing performance:")
ab_classifier_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.838256 | 0.264925 | 0.682692 | 0.38172 |
# Creating confusion matrix
confusion_matrix_sklearn(ab_classifier, X_test, y_test)
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=6, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=4, random_state=1),
DecisionTreeClassifier(max_depth=5, random_state=1),
],
"n_estimators": [10, 110, 10],
"learning_rate": [0.001, 0.002, 0.003, 0.004, 0.005],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=6,
random_state=1),
learning_rate=0.004, n_estimators=110, random_state=1)
# Calculating train performance
abc_tuned_model_train_perf = model_performance_classification_sklearn(
abc_tuned, X_train, y_train
)
print("Training performance:")
abc_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.924028 | 0.628205 | 0.951456 | 0.756757 |
# Calculating test performance
abc_tuned_model_test_perf = model_performance_classification_sklearn(
abc_tuned, X_test, y_test
)
print("Testing performance:")
abc_tuned_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.855837 | 0.440299 | 0.682081 | 0.535147 |
# Creating confusion matrix
confusion_matrix_sklearn(abc_tuned, X_test, y_test)
# Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train, y_train)
GradientBoostingClassifier(random_state=1)
# Calculating train performance
gb_classifier_model_train_perf = model_performance_classification_sklearn(
gb_classifier, X_train, y_train
)
print("Training performance:")
gb_classifier_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.884836 | 0.474359 | 0.845714 | 0.607803 |
# Calculating test performance
gb_classifier_model_test_perf = model_performance_classification_sklearn(
gb_classifier, X_test, y_test
)
print("Testing performance:")
gb_classifier_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.850211 | 0.373134 | 0.689655 | 0.484262 |
# Creating confusion matrix
confusion_matrix_sklearn(gb_classifier, X_test, y_test)
# we can tune the gradient boosting classifier by initializing with AdaBoost Classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
# Grid of parameters to choose from
parameters = {
"n_estimators": [20, 30, 50],
"subsample": [0.8, 0.9, 1],
"max_features": [0.5, 0.7, 1],
"learning_rate": [0.001, 0.004, 0.005],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.004, max_features=0.7,
n_estimators=20, random_state=1, subsample=1)
# Calculating train performance
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
gbc_tuned, X_train, y_train
)
print("Training performance:")
gbc_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.845342 | 0.315705 | 0.696113 | 0.434399 |
# Calculating test performance
gbc_tuned_model_test_perf = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
print("Testing performance:")
gbc_tuned_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83052 | 0.264925 | 0.617391 | 0.370757 |
# Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
# Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)
# Calculating train performance
xgb_classifier_model_train_perf = model_performance_classification_sklearn(
xgb_classifier, X_train, y_train
)
print("Training performance:")
xgb_classifier_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995779 | 0.977564 | 1.0 | 0.988655 |
# Calculating test performance
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
xgb_classifier, X_test, y_test
)
print("Testing performance:")
xgb_classifier_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.892405 | 0.574627 | 0.797927 | 0.668113 |
# Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)
# choose the type of classifier
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss", tree_method="hist")
# Grid of parameters to choose from
parameters = {
"n_estimators": [20, 30, 40, 50, 100],
"learning_rate": [0.001, 0.002, 0.003, 0.004, 0.005],
"scale_pos_weight": [5],
"subsample": [0.7, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.004, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)
# Calculating train performance
xgb_tuned_model_train_perf = model_performance_classification_sklearn(
xgb_tuned, X_train, y_train
)
print("Training performance:")
xgb_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.828158 | 0.783654 | 0.529221 | 0.631783 |
# Calculating test performance
xgb_tuned_model_test_perf = model_performance_classification_sklearn(
xgb_tuned, X_test, y_test
)
print("Testing performance:")
xgb_tuned_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.770745 | 0.664179 | 0.429952 | 0.521994 |
# Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)
# choose the base estimators
estimators = [
("Random Forest", rf_tuned),
("Gradient Boosting", gbc_tuned),
("Decision Tree", dtree_estimator),
("AdaBoost", abc_tuned),
]
# choose the final estimator
final_estimator = xgb_classifier
stacking_classifier = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
# fitting the model
stacking_classifier.fit(X_train, y_train)
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(class_weight={0: 0.19,
1: 0.81},
max_depth=4,
max_features='sqrt',
n_estimators=120,
oob_score=True,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.004,
max_features=0.7,
n_estimators=20,
random_state=1,
subsample=1)),
(...
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=1, ...))
# Calculating train performance
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
stacking_classifier, X_train, y_train
)
print("Training performance:")
stacking_classifier_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89388 | 0.588141 | 0.794372 | 0.675875 |
# Calculating test performance
stacking_classifier_model_test_perf = model_performance_classification_sklearn(
stacking_classifier, X_test, y_test
)
print("Testing performance:")
stacking_classifier_model_test_perf
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.847398 | 0.451493 | 0.633508 | 0.527233 |
# Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier, X_test, y_test)
In the next session, we will compile and consolidate the performances of each model, in order to compare them against another.
# consolidate performances of all train datasets into one dataframe
models_train_comp_df = pd.concat(
[
dtree_model_train_perf.T,
dtree_estimator_model_train_perf.T,
rf_estimator_model_train_perf.T,
rf_tuned_model_train_perf.T,
bagging_classifier_model_train_perf.T,
bagging_estimator_tuned_model_train_perf.T,
bagging_lr_model_train_perf.T,
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest Estimator",
"Random Forest Tuned-Inverse",
"Bagging Classifier",
"Bagging Estimator Tuned-Inverse",
"Bagging Classifier-LogReg",
"Adaboost Classifier",
"Adaboost Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned-Inverse | Bagging Classifier | Bagging Estimator Tuned-Inverse | Bagging Classifier-LogReg | Adaboost Classifier | Adaboost Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.806452 | 1.0 | 0.807055 | 0.990654 | 0.593307 | 0.839916 | 0.848055 | 0.924028 | 0.884836 | 0.845342 | 0.995779 | 0.828158 | 0.893880 |
| Recall | 1.0 | 0.618590 | 1.0 | 0.685897 | 0.955128 | 0.604167 | 0.278846 | 0.307692 | 0.628205 | 0.474359 | 0.315705 | 0.977564 | 0.783654 | 0.588141 |
| Precision | 1.0 | 0.488608 | 1.0 | 0.490826 | 0.994992 | 0.254902 | 0.682353 | 0.727273 | 0.951456 | 0.845714 | 0.696113 | 1.000000 | 0.529221 | 0.794372 |
| F1 | 1.0 | 0.545969 | 1.0 | 0.572193 | 0.974652 | 0.358535 | 0.395904 | 0.432432 | 0.756757 | 0.607803 | 0.434399 | 0.988655 | 0.631783 | 0.675875 |
# consolidate performances of all test datasets into one dataframe
models_test_comp_df = pd.concat(
[
dtree_model_test_perf.T,
dtree_estimator_model_test_perf.T,
rf_estimator_model_test_perf.T,
rf_tuned_model_test_perf.T,
bagging_classifier_model_test_perf.T,
bagging_estimator_tuned_model_test_perf.T,
bagging_lr_model_test_perf.T,
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest Estimator",
"Random Forest Tuned-Inverse",
"Bagging Classifier",
"Bagging Estimator Tuned-Inverse",
"Bagging Classifier-LogReg",
"Adaboost Classifier",
"Adaboost Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned-Inverse | Bagging Classifier | Bagging Estimator Tuned-Inverse | Bagging Classifier-LogReg | Adaboost Classifier | Adaboost Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.871308 | 0.774965 | 0.873418 | 0.774262 | 0.869198 | 0.567511 | 0.833333 | 0.838256 | 0.855837 | 0.850211 | 0.830520 | 0.892405 | 0.770745 | 0.847398 |
| Recall | 0.626866 | 0.570896 | 0.421642 | 0.634328 | 0.496269 | 0.548507 | 0.242537 | 0.264925 | 0.440299 | 0.373134 | 0.264925 | 0.574627 | 0.664179 | 0.451493 |
| Precision | 0.669323 | 0.427374 | 0.818841 | 0.432570 | 0.722826 | 0.229329 | 0.656566 | 0.682692 | 0.682081 | 0.689655 | 0.617391 | 0.797927 | 0.429952 | 0.633508 |
| F1 | 0.647399 | 0.488818 | 0.556650 | 0.514372 | 0.588496 | 0.323432 | 0.354223 | 0.381720 | 0.535147 | 0.484262 | 0.370757 | 0.668113 | 0.521994 | 0.527233 |
# bring the test and train dataframes into a single dataframe to do side by side comparison
# rename the columns of the train dataset and transpose
models_train_comp_df = models_train_comp_df.rename(
index={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
}
)
models_train_comp_df = models_train_comp_df.T
# rename the columns of the test dataset and transpose
models_test_comp_df = models_test_comp_df.rename(
index={
"Accuracy": "Accuracy_test",
"Recall": "Recall_test",
"Precision": "Precision_test",
"F1": "F1_test",
}
)
models_test_comp_df = models_test_comp_df.T
# concatenate both the above datasets into a single one.
final_model_comp = pd.concat([models_train_comp_df, models_test_comp_df], axis=1)
final_model_comp = final_model_comp[
[
"Accuracy_train",
"Accuracy_test",
"Recall_train",
"Recall_test",
"Precision_train",
"Precision_test",
"F1_train",
"F1_test",
]
]
final_model_comp
| Accuracy_train | Accuracy_test | Recall_train | Recall_test | Precision_train | Precision_test | F1_train | F1_test | |
|---|---|---|---|---|---|---|---|---|
| Decision Tree | 1.000000 | 0.871308 | 1.000000 | 0.626866 | 1.000000 | 0.669323 | 1.000000 | 0.647399 |
| Decision Tree Estimator | 0.806452 | 0.774965 | 0.618590 | 0.570896 | 0.488608 | 0.427374 | 0.545969 | 0.488818 |
| Random Forest Estimator | 1.000000 | 0.873418 | 1.000000 | 0.421642 | 1.000000 | 0.818841 | 1.000000 | 0.556650 |
| Random Forest Tuned-Inverse | 0.807055 | 0.774262 | 0.685897 | 0.634328 | 0.490826 | 0.432570 | 0.572193 | 0.514372 |
| Bagging Classifier | 0.990654 | 0.869198 | 0.955128 | 0.496269 | 0.994992 | 0.722826 | 0.974652 | 0.588496 |
| Bagging Estimator Tuned-Inverse | 0.593307 | 0.567511 | 0.604167 | 0.548507 | 0.254902 | 0.229329 | 0.358535 | 0.323432 |
| Bagging Classifier-LogReg | 0.839916 | 0.833333 | 0.278846 | 0.242537 | 0.682353 | 0.656566 | 0.395904 | 0.354223 |
| Adaboost Classifier | 0.848055 | 0.838256 | 0.307692 | 0.264925 | 0.727273 | 0.682692 | 0.432432 | 0.381720 |
| Adaboost Classifier Tuned | 0.924028 | 0.855837 | 0.628205 | 0.440299 | 0.951456 | 0.682081 | 0.756757 | 0.535147 |
| Gradient Boost Classifier | 0.884836 | 0.850211 | 0.474359 | 0.373134 | 0.845714 | 0.689655 | 0.607803 | 0.484262 |
| Gradient Boost Classifier Tuned | 0.845342 | 0.830520 | 0.315705 | 0.264925 | 0.696113 | 0.617391 | 0.434399 | 0.370757 |
| XGBoost Classifier | 0.995779 | 0.892405 | 0.977564 | 0.574627 | 1.000000 | 0.797927 | 0.988655 | 0.668113 |
| XGBoost Classifier Tuned | 0.828158 | 0.770745 | 0.783654 | 0.664179 | 0.529221 | 0.429952 | 0.631783 | 0.521994 |
| Stacking Classifier | 0.893880 | 0.847398 | 0.588141 | 0.451493 | 0.794372 | 0.633508 | 0.675875 | 0.527233 |
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
We have extensively analysed the data and built several models based on ensemble techniques. Before we make recommendations on which model to use, let us look at a summary of the work we've done.
Numerical
Categorical
When a tourism company like Visit With Us wants to attract more customers, our prediction model should be able to correctly predict actual customers. Missing out on potential customers can result in opportunity loss and loss of revenue. Therefore we evaluated our models on the objective of reducing false negatives as much as possible. Therefore, the model's objective was to have good recall scores.
We built 14 models, which are the following:
The model that gave us best results was the tuned Random Forest with inverse class weights.
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Creating confusion matrix
confusion_matrix_sklearn(rf_tuned, X_test, y_test)
True Positives (TP): the model correctly predicted 170 customers will purchase products from the company
True Negatives (TN): the model correctly predicted that 931 customers will not purchase products from the company
False Positives (FP): the model incorrectly predicted that 223 customers will purchase products from the company (a "Type I error")
False Negatives (FN): the model incorrectly predicted that 98 customers will not purchase products from the company (a "Type II error")
The primary target group of the company should be:
Basic is the most popular product, followed by Deluxe, then Standard, King and Super Deluxe respectively. Therefore Basic and Deluxe should be marketed as entry level packages with affordable and attractive facilities.
Note: Interaction parameters are more or less same across the products. Most products have a pitch duration of short to regular. Customer satisfaction for products are mostly average, but they can be anywhere from 1-5. Most products require 3-4 followups, except for Super Deluxe which requires only 1-2.